PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565

rhendric · 2017-06-01T01:40:30Z

Add a large random test case for remove_unused_levels that failed the
previous implementation
Fix PERF: remove_unused_levels is very slow #16556, a performance issue with the previous implementation
~~Add inplace functionality~~
Always return (if not inplace) at least a view instead of the original
index

codecov · 2017-06-01T02:18:13Z

Codecov Report

Merging #16565 into master will decrease coverage by 50.59%.
The diff coverage is 6.25%.

@@            Coverage Diff             @@
##           master   #16565      +/-   ##
==========================================
- Coverage   90.75%   40.16%   -50.6%     
==========================================
  Files         161      161              
  Lines       51073    51078       +5     
==========================================
- Hits        46350    20513   -25837     
- Misses       4723    30565   +25842

Flag	Coverage Δ
#multiple	`?`
#single	`40.16% <6.25%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`32.27% <6.25%> (-64.38%)`	⬇️
pandas/types/concat.py	`0% <0%> (-100%)`	⬇️
pandas/tools/hashing.py	`0% <0%> (-100%)`	⬇️
pandas/parser.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tslib.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-90.67%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.28%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
pandas/io/json/normalize.py	`8.16% <0%> (-88.78%)`	⬇️
... and 111 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee8346d...dbd9ddc. Read the comment docs.

codecov · 2017-06-01T02:18:14Z

Codecov Report

Merging #16565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16565      +/-   ##
==========================================
- Coverage   90.75%   90.75%   -0.01%     
==========================================
  Files         161      161              
  Lines       51095    51097       +2     
==========================================
+ Hits        46370    46371       +1     
- Misses       4725     4726       +1

Flag	Coverage Δ
#multiple	`88.59% <100%> (-0.01%)`	⬇️
#single	`40.16% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`96.65% <100%> (ø)`	⬆️
pandas/core/indexes/datetimes.py	`95.23% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db419bf...83bdc59. Read the comment docs.

jreback

looks good. pls add a 0.20.2 bug fix indexing (you can also say perf improvement there), whatsnew note. we r releasing tomorrow, so do today if you can.

jreback · 2017-06-01T10:14:58Z

pandas/core/indexes/multi.py

@@ -1263,6 +1263,11 @@ def remove_unused_levels(self):

        .. versionadded:: 0.20.0

+        Parameters


Indexes are immutable. We never use inplace. remove.

Um, are you sure? Index.set_names, MultiIndex.set_levels, and MultiIndex.set_labels already use inplace. And unlike the latter two functions, remove_unused_levels can't change the result of values, meaning it should be very safe to use in place since it won't change the semantics of any data structures using the index.

yes i am sure
those are terrible patterns for an immutable object and should be removed

jreback · 2017-06-01T10:21:33Z

pandas/core/indexes/multi.py

-        # nothing changed
-        if not changed.any():
-            return self
+        if inplace:


just use shallow_copy here

jreback · 2017-06-01T10:22:18Z

pandas/tests/indexes/test_multi.py

+        assert result2.is_(result)
+
+    def test_remove_unused_levels_large(self):
+        # because tests should be deterministic:


add the issue number as a comment

jreback · 2017-06-01T10:22:30Z

pandas/tests/indexes/test_multi.py

+
+    def test_remove_unused_levels_large(self):
+        # because tests should be deterministic:
+        rng = np.random.RandomState(4)  # chosen by fair dice roll.


don't use random state here

jreback · 2017-06-01T10:22:51Z

pandas/tests/indexes/test_multi.py

+        assert len(result.levels[1]) < len(df.index.levels[1])
+        assert result.equals(df.index)
+
+        # in place


jreback · 2017-06-01T10:23:22Z

pandas/tests/indexes/test_multi.py

+        result = df.index.remove_unused_levels()
+        assert len(result.levels[0]) < len(df.index.levels[0])
+        assert len(result.levels[1]) < len(df.index.levels[1])
+        assert result.equals(df.index)


also tests with different things for the levels, e.g. add another example with dates & strings

jreback · 2017-06-01T10:24:15Z

pandas/tests/indexes/test_multi.py

+        assert len(result.levels[0]) < len(df.index.levels[0])
+        assert len(result.levels[1]) < len(df.index.levels[1])
+        assert result.equals(df.index)
+


also construct that this IS equal to .reset_index(..).set_index(...).index. (for each example)

rhendric · 2017-06-01T16:40:14Z

@jreback, I removed RandomState from the tests as you asked, but the test now occasionally fails for me when I run pytest because of the non-determinism. (The levels are no longer guaranteed to be strictly smaller in the output.) Is this actually better? I'd rather not change the inequality checks to be <= because I want to assert that remove_unused_levels is doing something. Which is best: a weaker test, a test that has a small random chance of failing, or a test that uses RandomState?

(Oh look, it just failed like this in CI.)

rhendric · 2017-06-01T17:04:40Z

pandas/tests/indexes/test_multi.py

+            assert result.equals(df.index)
+
+            expected = df.reset_index().set_index(['first', 'second']).index
+            assert result.equals(expected)


I think I should replace this with the stronger tm.assert_index_equal(). I plan to do so when updating this test based on what you tell me to do about the non-determinism.

use parameterize instead of doing it this way: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#using-pytest

jreback · 2017-06-01T17:09:36Z

pandas/tests/indexes/test_multi.py

+            assert result.equals(df.index)
+
+            expected = df.reset_index().set_index(['first', 'second']).index
+            assert result.equals(expected)


use parameterize instead of doing it this way: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#using-pytest

jreback · 2017-06-01T17:11:23Z

@jreback, I removed RandomState from the tests as you asked, but the test now occasionally fails for me when I run pytest because of the non-determinism. (The levels are no longer guaranteed to be strictly smaller in the output.) Is this actually better? I'd rather not change the inequality checks to be <= because I want to assert that remove_unused_levels is doing something. Which is best: a weaker test, a test that has a small random chance of failing, or a test that uses RandomState?

oh that's fine. wasn't really sure why you were explicity setting the random. yes sometime you will not remove levels based on the filter, so ok putting it back.

jreback · 2017-06-01T22:16:39Z

@rhendric lgtm. can you add an asv for this (asv/benchmarks/indexing.py) is ok. you can use your test example (just the int/int is fine). and pls post the result.

* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Always return at least a view instead of the original index

rhendric · 2017-06-02T00:02:22Z

I'm new to asv; what is the result you want me to post? Console output? A file somewhere?

jreback · 2017-06-02T00:07:11Z

implement a suitable asv. then you run asv continuous master yourbranch -b regexttoselectit

it should run and then post results. see http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite

jreback · 2017-06-02T00:07:30Z

asv_bench/benchmarks/indexing.py

@@ -248,6 +254,9 @@ def time_multiindex_small_get_loc_warm(self):
    def time_is_monotonic(self):
        self.miint.is_monotonic

+    def time_remove_unused_levels(self):


rhendric · 2017-06-02T00:10:20Z

       before           after         ratio
     [fb47ee56]       [83bdc590]
-         185±2ms          846±2μs     0.00  indexing.MultiIndexing.time_remove_unused_levels

jreback · 2017-06-02T00:22:24Z

ok that's slightly better 😄

jreback · 2017-06-02T00:23:15Z

thanks!

(cherry picked from commit 8d092d9)

Version 0.20.2 * tag 'v0.20.2': (68 commits) RLS: v0.20.2 DOC: Update release.rst DOC: Whatsnew fixups (pandas-dev#16596) ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460) BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444) PERF: vectorize _interp_limit (pandas-dev#16592) DOC: whatsnew 0.20.2 edits (pandas-dev#16587) API: Make is_strictly_monotonic_* private (pandas-dev#16576) BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565) Strictly monotonic (pandas-dev#16555) ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026) fix linting BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244) BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317) return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486) BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549) BUG: Fixed tput output on windows (pandas-dev#16496) Strictly monotonic (pandas-dev#16555) BUG: fixed wrong order of ordered labels in pd.cut() BUG: Fixed to_html ignoring index_names parameter ...

rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch from 4231e0c to dbd9ddc Compare June 1, 2017 02:18

jreback requested changes Jun 1, 2017

View reviewed changes

jreback added Bug MultiIndex Performance Memory or execution speed performance labels Jun 1, 2017

jreback added this to the 0.20.2 milestone Jun 1, 2017

rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch 2 times, most recently from 8a9fe43 to 62c5907 Compare June 1, 2017 15:17

rhendric commented Jun 1, 2017

View reviewed changes

jreback approved these changes Jun 1, 2017

View reviewed changes

rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch from 62c5907 to 9852c6f Compare June 1, 2017 17:37

jreback added the Needs Backport label Jun 1, 2017

BUG: reimplement MultiIndex.remove_unused_levels

83bdc59

* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Always return at least a view instead of the original index

rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch from 9852c6f to 83bdc59 Compare June 2, 2017 00:03

jreback reviewed Jun 2, 2017

View reviewed changes

jreback changed the title ~~BUG: reimplement MultiIndex.remove_unused_levels~~ PERF/BUG: reimplement MultiIndex.remove_unused_levels Jun 2, 2017

jreback merged commit 8d092d9 into pandas-dev:master Jun 2, 2017

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 2, 2017

BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)

f8e53b6

(cherry picked from commit 8d092d9)

TomAugspurger pushed a commit that referenced this pull request Jun 4, 2017

BUG: reimplement MultiIndex.remove_unused_levels (#16565)

1564a6e

(cherry picked from commit 8d092d9)

TomAugspurger removed the Needs Backport label Jun 4, 2017

Kiv pushed a commit to Kiv/pandas that referenced this pull request Jun 11, 2017

BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)

ec535e9

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)

afb4f59

rhendric deleted the fix/MultiIndex.remove_unused_levels branch November 24, 2022 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565

PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565

rhendric commented Jun 1, 2017 •

edited

Loading

codecov bot commented Jun 1, 2017

codecov bot commented Jun 1, 2017 •

edited

Loading

jreback left a comment

jreback Jun 1, 2017

rhendric Jun 1, 2017

jreback Jun 1, 2017

jreback Jun 1, 2017

jreback Jun 1, 2017

jreback Jun 1, 2017

jreback Jun 1, 2017

jreback Jun 1, 2017

jreback Jun 1, 2017

rhendric commented Jun 1, 2017 •

edited

Loading

rhendric Jun 1, 2017

jreback Jun 1, 2017

jreback Jun 1, 2017

jreback commented Jun 1, 2017

jreback commented Jun 1, 2017

rhendric commented Jun 2, 2017

jreback commented Jun 2, 2017

jreback Jun 2, 2017

rhendric commented Jun 2, 2017

jreback commented Jun 2, 2017

jreback commented Jun 2, 2017

		@@ -1263,6 +1263,11 @@ def remove_unused_levels(self):

		.. versionadded:: 0.20.0

		Parameters

PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565

PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565

Conversation

rhendric commented Jun 1, 2017 • edited Loading

codecov bot commented Jun 1, 2017

Codecov Report

codecov bot commented Jun 1, 2017 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhendric commented Jun 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 1, 2017

jreback commented Jun 1, 2017

rhendric commented Jun 2, 2017

jreback commented Jun 2, 2017

Choose a reason for hiding this comment

rhendric commented Jun 2, 2017

jreback commented Jun 2, 2017

jreback commented Jun 2, 2017

rhendric commented Jun 1, 2017 •

edited

Loading

codecov bot commented Jun 1, 2017 •

edited

Loading

rhendric commented Jun 1, 2017 •

edited

Loading